{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Utilisation d'un *LLM-as-a-judge* 🧑‍⚖️ pour une évaluation automatisée et polyvalente \n",
    "_Auteur : [Aymeric Roucher](https://huggingface.co/m-ric)_  \n",
    "_Traducteur : [Loïck Bourdois](https://hf.co/lbourdois)_\n",
    "\n",
    "L'évaluation des grands modèles de langage (LLM) est souvent une entreprise difficile : compte tenu de leurs vastes capacités, les tâches qui leur sont confiées doivent souvent être jugées sur la base d'exigences très larges et peu précises. Par exemple, la réponse d'un assistant à une question peut être :\n",
    "- non fondée sur le contexte\n",
    "- répétitive, répétitive, répétitive\n",
    "- grammaticalement incorrecte\n",
    "- excessivement longue et caractérisée par une surabondance de mots, conduisant à une situation où le discours ou le contenu écrit devient excessivement détaillé et prolongé\n",
    "- incohérent\n",
    "- ...\n",
    "\n",
    "La liste des critères est encore longue. Et même si nous disposions d'une liste limitée, chacun d'entre eux serait difficile à mesurer : « concevoir un programme basé sur des règles pour évaluer les sorties est extrêmement difficile. Les mesures d'évaluation traditionnelles basées sur la similarité entre les résultats et les réponses de référence (par exemple, [ROUGE](https://hf.co/spaces/evaluate-metric/rouge), [BLEU](https://hf.co/spaces/evaluate-metric/bleu)) sont également inefficaces pour ces questions. »\n",
    "\n",
    "✅ Une solution puissante pour évaluer les sorties d'une manière humaine, sans nécessiter de temps humain coûteux, est l'utilisation d'un *LLM-as-a-judge* (qu'on désignera simplement « juge » par la suite) c'est-à-dire d'un second modèle pour juger les sorties du premier modèle.\n",
    "Cette méthode a été présentée dans [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://huggingface.co/papers/2306.05685) que je vous encourage à lire.\n",
    "\n",
    "💡 L'idée est simple : demander à un LLM de faire la notation à votre place. 🤖✓ \n",
    "\n",
    "Mais nous verrons qu'il n'est pas prêt à l'emploi : il faut le paramétrer avec soin pour obtenir de bons résultats."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install huggingface_hub datasets pandas tqdm -q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "import pandas as pd\n",
    "from tqdm.auto import tqdm\n",
    "from datasets import load_dataset\n",
    "from huggingface_hub import InferenceClient, notebook_login\n",
    "\n",
    "tqdm.pandas()  # charger le support pandas de tqdm\n",
    "pd.set_option(\"display.max_colwidth\", None)\n",
    "\n",
    "notebook_login()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\n\\nI’m good, thanks. I’m in the middle of a tour at the'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "repo_id = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\n",
    "\n",
    "llm_client = InferenceClient(\n",
    "    model=repo_id,\n",
    "    timeout=120,\n",
    ")\n",
    "\n",
    "# Tester votre client LLM\n",
    "llm_client.text_generation(prompt=\"How are you today?\", max_new_tokens=20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notons qu'on interragit en anglais avec ce modèle car nous utilisons ci-dessous un jeu de données en anglais. En pratique, le `Mixtral-8x7B-Instruct-v0.1` serait utilisable pour du français.\n",
    "\n",
    "\n",
    "## 1. Préparer la création et l'évaluation de notre juge\n",
    "\n",
    "Supposons que vous souhaitiez confier à un LLM une tâche spécifique, comme répondre à des questions ouvertes.\n",
    "\n",
    "La difficulté réside dans le fait que, comme nous l'avons vu plus haut, il est difficile de mesurer la qualité de la réponse. Par exemple, une correspondance exacte signalera comme fausses un trop grand nombre de réponses correctes mais formulées différemment.\n",
    "\n",
    "Vous pourriez demander à des humains d'évaluer les résultats, mais cela leur prendrait beaucoup de temps, et si vous voulez mettre à jour le modèle ou les questions, vous devriez tout recommencer.\n",
    "\n",
    "\n",
    "✅ Dans ce cas, vous pouvez configurer un juge.\n",
    "\n",
    "**Mais pour en utiliser un, vous devrez d'abord évaluer la fiabilité avec laquelle il évalue les résultats de votre modèle.**\n",
    "\n",
    "➡️ La première étape sera donc... de créer un jeu de données d'évaluation par des humains. Quelques exemples d'annotations humaines, une trentaine seulement, devraient suffire pour se faire une bonne idée des performances du modèle.\n",
    "Vous pourrez réutiliser ce jeu de données chaque fois que vous voudrez tester votre juge.\n",
    "\n",
    "Dans notre cas, nous utiliserons [`feedbackQA`](https://huggingface.co/datasets/McGill-NLP/feedbackQA), qui contient 2 évaluations humaines et des scores pour chaque couple question/réponse. L'utilisation d'un échantillon de 30 exemples sera représentative de ce que votre petit jeu de données d'évaluation pourrait être."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ratings = load_dataset(\"McGill-NLP/feedbackQA\")[\"train\"]\n",
    "ratings = pd.DataFrame(ratings)\n",
    "\n",
    "ratings[\"review_1\"] = ratings[\"feedback\"].apply(lambda x: x[\"rating\"][0])\n",
    "ratings[\"explanation_1\"] = ratings[\"feedback\"].apply(lambda x: x[\"explanation\"][0])\n",
    "ratings[\"review_2\"] = ratings[\"feedback\"].apply(lambda x: x[\"rating\"][1])\n",
    "ratings[\"explanation_2\"] = ratings[\"feedback\"].apply(lambda x: x[\"explanation\"][1])\n",
    "ratings = ratings.drop(columns=[\"feedback\"])\n",
    "\n",
    "# Associer des scores à des valeurs numériques\n",
    "conversion_dict = {\"Excellent\": 4, \"Acceptable\": 3, \"Could be Improved\": 2, \"Bad\": 1}\n",
    "ratings[\"score_1\"] = ratings[\"review_1\"].map(conversion_dict)\n",
    "ratings[\"score_2\"] = ratings[\"review_2\"].map(conversion_dict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "C'est toujours une bonne idée de calculer une *baseline* pour les performances : ici, il peut s'agir par exemple de l'accord entre les deux évaluateurs humains, mesuré par la [corrélation de Pearson](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) des scores qu'ils attribuent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Correlation between 2 human raters:\n",
      "0.563\n"
     ]
    }
   ],
   "source": [
    "print(\"Correlation between 2 human raters:\")\n",
    "print(f\"{ratings['score_1'].corr(ratings['score_2'], method='pearson'):.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Cette corrélation entre deux évaluateurs humains n'est pas très bonne. Si vos évaluations humaines sont vraiment mauvaises, cela signifie probablement que les critères d'évaluation ne sont pas suffisamment clairs.\n",
    "\n",
    "Cela signifie que notre « vérité de base » contient du bruit : il ne faut donc pas s'attendre à ce qu'une évaluation algorithmique s'en rapproche.\n",
    "\n",
    "Cependant, nous pouvons réduire ce bruit :\n",
    "- en prenant le score moyen comme vérité de base au lieu d'un score unique, nous devrions égaliser certaines irrégularités.\n",
    "- en ne sélectionnant que les échantillons pour lesquels les évaluateurs humains sont d'accord.\n",
    "\n",
    "Ici, nous choisirons la dernière option et **ne conserverons que les exemples pour lesquels les deux évaluateurs humains sont d'accord**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question</th>\n",
       "      <th>answer</th>\n",
       "      <th>review_1</th>\n",
       "      <th>explanation_1</th>\n",
       "      <th>review_2</th>\n",
       "      <th>explanation_2</th>\n",
       "      <th>score_1</th>\n",
       "      <th>score_2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>human_score</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>What can I do to help people that are grieving?</td>\n",
       "      <td>Coping with Stress\\nTake care of yourself and your community\\nTaking care of yourself, your friends, and your family can help you cope with\\nstress. Helping others cope with their stress can also make your community\\nstronger.\\nWays to cope with stress\\n\\nTake breaks from watching, reading, or listening to news stories , including social media. Hearing about the pandemic repeatedly can be upsetting.\\nTake care of your body. \\nTake deep breaths, stretch, or meditate.\\nTry to eat healthy, well-balanced meals.\\nExercise regularly, get plenty of sleep.\\nAvoid alcohol and drugs.\\n\\n\\nMake time to unwind. Try to do some other activities you enjoy.\\nConnect with others. Talk with people you trust about your concerns and how you are feeling.\\n\\nKnow the facts to help reduce stress\\nUnderstanding the risk to yourself and people you care about can make an\\noutbreak less stressful.\\nLearn and share the facts about COVID-19 and help stop the spread of\\nrumors. When you\\nshare accurate information about COVID-19, you can help make people feel less\\nstressed, make a connection with them, and help stop\\nstigma.\\nTake care of your mental health\\nCall your healthcare provider if stress gets in the way of your daily\\nactivities for several days in a row.\\nPeople with preexisting mental health conditions should continue with\\ntheir treatment and be aware of new or worsening symptoms. Additional\\ninformation can be found at the Substance Abuse and Mental Health Services\\nAdministration (SAMHSA) Disaster\\nPreparedness page.\\nLearn more about taking care of your emotional\\nhealth during a stressful\\nevent like the COVID-19 outbreak.</td>\n",
       "      <td>Bad</td>\n",
       "      <td>The question is about others which the reply did not answer.</td>\n",
       "      <td>Bad</td>\n",
       "      <td>The response could have addressed how to help those that are grieving cope rather than what it was presenting.</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>What protocols do workplaces need to follow to keep everyone safer?</td>\n",
       "      <td>Coronavirus and Australian workplace laws\\nHealth &amp; safety in the workplace\\nWorkplaces must follow the rules about health and safety during coronavirus to\\nhelp stop it spreading. Find out more about:\\n\\nrules and obligations under workplace health and safety laws\\nhow to manage the risk of coronavirus in the workplace\\nwhere to go for help.\\n\\nLearn more about Health and safety in the workplace during\\ncoronavirus.</td>\n",
       "      <td>Could be Improved</td>\n",
       "      <td>This answer needs to be improved because it doesn’t provide information up-front about workplaces during the pandemic. Instead, it just includes a hyperlink.</td>\n",
       "      <td>Could be Improved</td>\n",
       "      <td>there is one link to information, but there is no information in the answer about how to stay safe in the workplace. it talks about the need to stay safe in the workplace, but it doesn't talk about ways in which to actually do that.</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>How soon can I apply for financial support?</td>\n",
       "      <td>COVID-19 early release of super\\nAfter you apply\\nIt will take us up to four business days to process your application and send\\nyour outcome letter to your myGov inbox. You may also receive an SMS\\nnotification.\\nIf you receive a notification from us and haven't applied to access your super\\nearly, you need to call us or your fund as soon as possible.\\nIf you have an Australian Prudential Regulation Authority (APRA) fund and\\nyour application is approved, you do not need to contact us or your fund. Your\\nfund will make the payment to you without you needing to apply to them\\ndirectly.\\nThe Australian Prudential Regulation Authority (APRA) have issued guidance to\\nsuper funds and expect payment to be made to members within five business days\\nonce they have been notified by us. However, this time may increase where\\nfunds need to contact you to clarify information. More information can be\\nfound on APRA's websiteExternal Link.\\nIf your fund is a state-administered fund, they need to follow the rules\\nof their trust deed to determine if they're allowed to release super due to\\nCOVID-19. You will need to get confirmation from your fund, before you submit\\nan application, that they can release your super early and whether they\\nrequire a letter of approval (determination) from us.\\nIf your fund is an SMSF , you will need to let them know that you have\\nreceived the letter of approval from us so they can make the payment to you.</td>\n",
       "      <td>Acceptable</td>\n",
       "      <td>There is information on how to apply for the help.  Still, there is nothing say how long you have to wait before applying.</td>\n",
       "      <td>Acceptable</td>\n",
       "      <td>This response says how long the applications take to process and then some more information about the process. There's a link to more relevant information. A pretty good answer</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Should vulnerable children be expected to be in educational settings?</td>\n",
       "      <td>Guidance Actions for schools during the coronavirus outbreak\\nPrioritising pupils\\nWhat are our expectations regarding vulnerable children and young people attending educational settings?\\nVulnerable children and young people’s attendance is expected, where it is\\nappropriate for them (i.e. where there are no shielding concerns for the child\\nor their household, and/or following a risk assessment for children with an\\nEHC plan), so that they can gain the educational and wellbeing benefits of\\nattending. Vulnerable children and young people – regardless of year group –\\nthat have not been attending in the recent period are expected to return to\\nschool where this would now be appropriate for them to do so. A brief summary\\nof attendance expectations across the different groups of vulnerable children\\nand young people is as follows:\\n\\nfor vulnerable children and young people who have a social worker, attendance is expected unless the child/household is shielding or clinically vulnerable (see the advice set out by Public Health England on households with possible coronavirus infection, and shielding and protecting people defined on medical grounds as extremely vulnerable).\\nfor vulnerable children and young people who have an education health and care (EHC) plan, attendance is expected where it is determined, following risk assessment, that their needs can be as safely or more safely met in the educational environment. Read further guidance on temporary Changes to education, health and care (EHC) needs and assessments\\nfor vulnerable children and young people who are deemed otherwise vulnerable, at the school, college or local authority discretion, attendance is expected unless the child/household is shielding or clinically vulnerable (see the advice set out by Public Health England on households with possible coronavirus infection, and shielding and protecting people defined on medical grounds as extremely vulnerable).\\n\\n*[EHC]: Education, Health and Care</td>\n",
       "      <td>Excellent</td>\n",
       "      <td>There is a lot of relevant information here.  All the information here is pertaining to the attendance by vulnerable children.</td>\n",
       "      <td>Excellent</td>\n",
       "      <td>This answers the questions and includes links and guides on how to help keep the kids healthy. It provides guidelines on what to do and how to bring the students back to school</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                          question  \\\n",
       "human_score                                                                          \n",
       "1                                  What can I do to help people that are grieving?   \n",
       "2              What protocols do workplaces need to follow to keep everyone safer?   \n",
       "3                                      How soon can I apply for financial support?   \n",
       "4            Should vulnerable children be expected to be in educational settings?   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            answer  \\\n",
       "human_score                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          \n",
       "1                                                                                                                                                                                                                                                                                                                                                                           Coping with Stress\\nTake care of yourself and your community\\nTaking care of yourself, your friends, and your family can help you cope with\\nstress. Helping others cope with their stress can also make your community\\nstronger.\\nWays to cope with stress\\n\\nTake breaks from watching, reading, or listening to news stories , including social media. Hearing about the pandemic repeatedly can be upsetting.\\nTake care of your body. \\nTake deep breaths, stretch, or meditate.\\nTry to eat healthy, well-balanced meals.\\nExercise regularly, get plenty of sleep.\\nAvoid alcohol and drugs.\\n\\n\\nMake time to unwind. Try to do some other activities you enjoy.\\nConnect with others. Talk with people you trust about your concerns and how you are feeling.\\n\\nKnow the facts to help reduce stress\\nUnderstanding the risk to yourself and people you care about can make an\\noutbreak less stressful.\\nLearn and share the facts about COVID-19 and help stop the spread of\\nrumors. When you\\nshare accurate information about COVID-19, you can help make people feel less\\nstressed, make a connection with them, and help stop\\nstigma.\\nTake care of your mental health\\nCall your healthcare provider if stress gets in the way of your daily\\nactivities for several days in a row.\\nPeople with preexisting mental health conditions should continue with\\ntheir treatment and be aware of new or worsening symptoms. Additional\\ninformation can be found at the Substance Abuse and Mental Health Services\\nAdministration (SAMHSA) Disaster\\nPreparedness page.\\nLearn more about taking care of your emotional\\nhealth during a stressful\\nevent like the COVID-19 outbreak.   \n",
       "2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Coronavirus and Australian workplace laws\\nHealth & safety in the workplace\\nWorkplaces must follow the rules about health and safety during coronavirus to\\nhelp stop it spreading. Find out more about:\\n\\nrules and obligations under workplace health and safety laws\\nhow to manage the risk of coronavirus in the workplace\\nwhere to go for help.\\n\\nLearn more about Health and safety in the workplace during\\ncoronavirus.   \n",
       "3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         COVID-19 early release of super\\nAfter you apply\\nIt will take us up to four business days to process your application and send\\nyour outcome letter to your myGov inbox. You may also receive an SMS\\nnotification.\\nIf you receive a notification from us and haven't applied to access your super\\nearly, you need to call us or your fund as soon as possible.\\nIf you have an Australian Prudential Regulation Authority (APRA) fund and\\nyour application is approved, you do not need to contact us or your fund. Your\\nfund will make the payment to you without you needing to apply to them\\ndirectly.\\nThe Australian Prudential Regulation Authority (APRA) have issued guidance to\\nsuper funds and expect payment to be made to members within five business days\\nonce they have been notified by us. However, this time may increase where\\nfunds need to contact you to clarify information. More information can be\\nfound on APRA's websiteExternal Link.\\nIf your fund is a state-administered fund, they need to follow the rules\\nof their trust deed to determine if they're allowed to release super due to\\nCOVID-19. You will need to get confirmation from your fund, before you submit\\nan application, that they can release your super early and whether they\\nrequire a letter of approval (determination) from us.\\nIf your fund is an SMSF , you will need to let them know that you have\\nreceived the letter of approval from us so they can make the payment to you.   \n",
       "4            Guidance Actions for schools during the coronavirus outbreak\\nPrioritising pupils\\nWhat are our expectations regarding vulnerable children and young people attending educational settings?\\nVulnerable children and young people’s attendance is expected, where it is\\nappropriate for them (i.e. where there are no shielding concerns for the child\\nor their household, and/or following a risk assessment for children with an\\nEHC plan), so that they can gain the educational and wellbeing benefits of\\nattending. Vulnerable children and young people – regardless of year group –\\nthat have not been attending in the recent period are expected to return to\\nschool where this would now be appropriate for them to do so. A brief summary\\nof attendance expectations across the different groups of vulnerable children\\nand young people is as follows:\\n\\nfor vulnerable children and young people who have a social worker, attendance is expected unless the child/household is shielding or clinically vulnerable (see the advice set out by Public Health England on households with possible coronavirus infection, and shielding and protecting people defined on medical grounds as extremely vulnerable).\\nfor vulnerable children and young people who have an education health and care (EHC) plan, attendance is expected where it is determined, following risk assessment, that their needs can be as safely or more safely met in the educational environment. Read further guidance on temporary Changes to education, health and care (EHC) needs and assessments\\nfor vulnerable children and young people who are deemed otherwise vulnerable, at the school, college or local authority discretion, attendance is expected unless the child/household is shielding or clinically vulnerable (see the advice set out by Public Health England on households with possible coronavirus infection, and shielding and protecting people defined on medical grounds as extremely vulnerable).\\n\\n*[EHC]: Education, Health and Care   \n",
       "\n",
       "                      review_1  \\\n",
       "human_score                      \n",
       "1                          Bad   \n",
       "2            Could be Improved   \n",
       "3                   Acceptable   \n",
       "4                    Excellent   \n",
       "\n",
       "                                                                                                                                                             explanation_1  \\\n",
       "human_score                                                                                                                                                                  \n",
       "1                                                                                                             The question is about others which the reply did not answer.   \n",
       "2            This answer needs to be improved because it doesn’t provide information up-front about workplaces during the pandemic. Instead, it just includes a hyperlink.   \n",
       "3                                               There is information on how to apply for the help.  Still, there is nothing say how long you have to wait before applying.   \n",
       "4                                           There is a lot of relevant information here.  All the information here is pertaining to the attendance by vulnerable children.   \n",
       "\n",
       "                      review_2  \\\n",
       "human_score                      \n",
       "1                          Bad   \n",
       "2            Could be Improved   \n",
       "3                   Acceptable   \n",
       "4                    Excellent   \n",
       "\n",
       "                                                                                                                                                                                                                                        explanation_2  \\\n",
       "human_score                                                                                                                                                                                                                                             \n",
       "1                                                                                                                                      The response could have addressed how to help those that are grieving cope rather than what it was presenting.   \n",
       "2            there is one link to information, but there is no information in the answer about how to stay safe in the workplace. it talks about the need to stay safe in the workplace, but it doesn't talk about ways in which to actually do that.   \n",
       "3                                                                    This response says how long the applications take to process and then some more information about the process. There's a link to more relevant information. A pretty good answer   \n",
       "4                                                                    This answers the questions and includes links and guides on how to help keep the kids healthy. It provides guidelines on what to do and how to bring the students back to school   \n",
       "\n",
       "             score_1  score_2  \n",
       "human_score                    \n",
       "1                  1        1  \n",
       "2                  2        2  \n",
       "3                  3        3  \n",
       "4                  4        4  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Exemples\n",
    "ratings_where_raters_agree = ratings.loc[ratings[\"score_1\"] == ratings[\"score_2\"]]\n",
    "examples = ratings_where_raters_agree.groupby(\"score_1\").sample(7, random_state=1214)\n",
    "examples[\"human_score\"] = examples[\"score_1\"]\n",
    "\n",
    "# Visualiser 1 échantillon pour chaque score\n",
    "display(examples.groupby(\"human_score\").first())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Créer notre juge\n",
    "Nous construisons notre juge avec un prompt de base, contenant ces éléments :\n",
    "- description de la tâche\n",
    "- description de l'échelle : `minimum`, `maximum`, types de valeurs (ici `float`)\n",
    "- explication du format de sortie\n",
    "- un début de réponse, pour prendre le LLM par la main aussi loin que possible"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "JUDGE_PROMPT = \"\"\"\n",
    "You will be given a user_question and system_answer couple.\n",
    "Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.\n",
    "Give your answer as a float on a scale of 0 to 10, where 0 means that the system_answer is not helpful at all, and 10 means that the answer completely and helpfully addresses the question.\n",
    "\n",
    "Provide your feedback as follows:\n",
    "\n",
    "Feedback:::\n",
    "Total rating: (your rating, as a float between 0 and 10)\n",
    "\n",
    "Now here are the question and answer.\n",
    "\n",
    "Question: {question}\n",
    "Answer: {answer}\n",
    "\n",
    "Feedback:::\n",
    "Total rating: \"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Cellule précédente traduite en français pour illustrer un exemple de prompt\n",
    "JUDGE_PROMPT = \"\"\"\n",
    "Vous recevrez un couple user_question et system_answer.\n",
    "Votre tâche consiste à donner une `note totale` indiquant dans quelle mesure la réponse du système répond aux préoccupations de l'utilisateur exprimées dans question_utilisateur.\n",
    "Donnez votre réponse sous la forme d'un flottant sur une échelle de 0 à 10, où 0 signifie que la réponse du système n'est pas du tout utile, et 10 signifie que la réponse répond complètement et utilement à la question.\n",
    "\n",
    "Donnez votre avis comme suit :\n",
    "\n",
    "Avis:::\n",
    "Note totale : (votre note, sous forme la forme d'un flottant entre 0 et 10)\n",
    "\n",
    "Voici maintenant la question et la réponse.\n",
    "\n",
    "Question : {question}\n",
    "Réponse : {réponse}\n",
    "\n",
    "Avis:::\n",
    "Total rating : \"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "examples[\"llm_judge\"] = examples.progress_apply(\n",
    "    lambda x: llm_client.text_generation(\n",
    "        prompt=JUDGE_PROMPT.format(question=x[\"question\"], answer=x[\"answer\"]),\n",
    "        max_new_tokens=1000,\n",
    "    ),\n",
    "    axis=1,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_judge_score(answer: str, split_str: str = \"Total rating:\") -> int:\n",
    "    try:\n",
    "        if split_str in answer:\n",
    "            rating = answer.split(split_str)[1]\n",
    "        else:\n",
    "            rating = answer\n",
    "        digit_groups = [el.strip() for el in re.findall(r\"\\d+(?:\\.\\d+)?\", rating)]\n",
    "        return float(digit_groups[0])\n",
    "    except Exception as e:\n",
    "        print(e)\n",
    "        return None\n",
    "\n",
    "\n",
    "examples[\"llm_judge_score\"] = examples[\"llm_judge\"].apply(extract_judge_score)\n",
    "# Rééchelonner le score donné par le LLM sur la même échelle que le score humain\n",
    "examples[\"llm_judge_score\"] = (examples[\"llm_judge_score\"] / 10) + 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Correlation between LLM-as-a-judge and the human raters:\n",
      "0.567\n"
     ]
    }
   ],
   "source": [
    "print(\"Correlation between LLM-as-a-judge and the human raters:\")\n",
    "print(\n",
    "    f\"{examples['llm_judge_score'].corr(examples['human_score'], method='pearson'):.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ce n'est pas si mal, étant donné que la corrélation de Pearson entre deux variables aléatoires et indépendantes serait de 0 !\n",
    "\n",
    "Mais nous pouvons facilement faire mieux. 🔝"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Améliorer le juge\n",
    "\n",
    "Comme montré par [Aparna Dhinakaran](https://twitter.com/aparnadhinak/status/1748368364395721128), les LLMs sont mauvais pour évaluer les sorties dans des plages continues.\n",
    "[Cet article](https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG) nous donne quelques bonnes pratiques pour construire un meilleur prompt :\n",
    "- ⏳ **Laisser plus de temps au juge pour la réflexion** en ajoutant un champ `Evaluation` avant la réponse finale.\n",
    "- 🔢 **Utiliser une plage de nombres entiers pour les notes possibles** comme 1-4 ou 1-5 au lieu d'une grande plage de nombres flottants comme nous l'avions auparavant.\n",
    "- 👩‍🏫 **Fournir des indications sur la valeur des notes pour guider le juge dans ses notations**.\n",
    "- Nous ajoutons même une carotte pour motiver le LLM !"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "IMPROVED_JUDGE_PROMPT = \"\"\"\n",
    "You will be given a user_question and system_answer couple.\n",
    "Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.\n",
    "Give your answer on a scale of 1 to 4, where 1 means that the system_answer is not helpful at all, and 4 means that the system_answer completely and helpfully addresses the user_question.\n",
    "\n",
    "Here is the scale you should use to build your answer:\n",
    "1: The system_answer is terrible: completely irrelevant to the question asked, or very partial\n",
    "2: The system_answer is mostly not helpful: misses some key aspects of the question\n",
    "3: The system_answer is mostly helpful: provides support, but still could be improved\n",
    "4: The system_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question\n",
    "\n",
    "Provide your feedback as follows:\n",
    "\n",
    "Feedback:::\n",
    "Evaluation: (your rationale for the rating, as a text)\n",
    "Total rating: (your rating, as a number between 1 and 4)\n",
    "\n",
    "You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.\n",
    "\n",
    "Now here are the question and answer.\n",
    "\n",
    "Question: {question}\n",
    "Answer: {answer}\n",
    "\n",
    "Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.\n",
    "Feedback:::\n",
    "Evaluation: \"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Cellule précédente traduite en français pour illustrer un exemple de prompt\n",
    "IMPROVED_JUDGE_PROMPT = \"\"\"\n",
    "Vous recevrez un couple user_question et system_answer.\n",
    "Votre tâche consiste à donner une `note totale` indiquant dans quelle mesure la réponse du système répond aux préoccupations de l'utilisateur exprimées dans question_utilisateur.\n",
    "Donnez votre réponse sur une échelle de 1 à 4, où 1 signifie que la réponse du système n'est pas du tout utile, et 4 signifie que la réponse du système répond complètement et utilement à la question de l'utilisateur.\n",
    "Voici l'échelle que vous devez utiliser pour construire votre réponse :\n",
    "1 : La system_answer est terrible : complètement hors de propos par rapport à la question posée, ou très partielle.\n",
    "2 : La system_answer  n'est pas utile pour l'essentiel : elle ne tient pas compte de certains aspects essentiels de la question.\n",
    "3 : La system_answer est en grande partie utile : elle apporte un soutien, mais pourrait encore être améliorée.\n",
    "4 : La system_answer est excellente : elle est pertinente, directe, détaillée et répond à toutes les préoccupations soulevées dans la question.\n",
    "\n",
    "Donnez votre avis comme suit :\n",
    "\n",
    "Avis:::\n",
    "Evaluation : (la justification de la notation, sous forme de texte)\n",
    "Note totale : (votre note, sous la forme d'un nombre compris entre 1 et 4)\n",
    "\n",
    "Vous DEVEZ fournir des valeurs pour « Évaluation : » et « Note totale : » dans votre réponse.\n",
    "\n",
    "Voici maintenant la question et la réponse.\n",
    "\n",
    "Question : {question}\n",
    "Réponse : {réponse}\n",
    "\n",
    "Donnez votre avis. Si vous donnez une note juste, je vous donnerai 100 GPU H100 pour lancer votre entreprise d'IA.\n",
    "Avis:::\n",
    "Evaluation : \"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "examples[\"llm_judge_improved\"] = examples.progress_apply(\n",
    "    lambda x: llm_client.text_generation(\n",
    "        prompt=IMPROVED_JUDGE_PROMPT.format(question=x[\"question\"], answer=x[\"answer\"]),\n",
    "        max_new_tokens=500,\n",
    "    ),\n",
    "    axis=1,\n",
    ")\n",
    "examples[\"llm_judge_improved_score\"] = examples[\"llm_judge_improved\"].apply(\n",
    "    extract_judge_score\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Correlation between LLM-as-a-judge and the human raters:\n",
      "0.843\n"
     ]
    }
   ],
   "source": [
    "print(\"Correlation between LLM-as-a-judge and the human raters:\")\n",
    "print(\n",
    "    f\"{examples['llm_judge_improved_score'].corr(examples['human_score'], method='pearson'):.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "La corrélation a été **améliorée de près de 30 %** avec seulement quelques ajustements dans le prompt (dont quelques points de pourcentage sont dus à mon conseil éhonté au LLM, que je déclare par la présente ne pas être juridiquement contraignant).\n",
    "\n",
    "Impressionnant ! 👏\n",
    "\n",
    "Affichons quelques erreurs de notre juge pour les analyser :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question</th>\n",
       "      <th>answer</th>\n",
       "      <th>human_score</th>\n",
       "      <th>explanation_1</th>\n",
       "      <th>llm_judge_improved_score</th>\n",
       "      <th>llm_judge_improved</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1976</th>\n",
       "      <td>What can I do to help people that are grieving?</td>\n",
       "      <td>Coping with Stress\\nTake care of yourself and your community\\nTaking care of yourself, your friends, and your family can help you cope with\\nstress. Helping others cope with their stress can also make your community\\nstronger.\\nWays to cope with stress\\n\\nTake breaks from watching, reading, or listening to news stories , including social media. Hearing about the pandemic repeatedly can be upsetting.\\nTake care of your body. \\nTake deep breaths, stretch, or meditate.\\nTry to eat healthy, well-balanced meals.\\nExercise regularly, get plenty of sleep.\\nAvoid alcohol and drugs.\\n\\n\\nMake time to unwind. Try to do some other activities you enjoy.\\nConnect with others. Talk with people you trust about your concerns and how you are feeling.\\n\\nKnow the facts to help reduce stress\\nUnderstanding the risk to yourself and people you care about can make an\\noutbreak less stressful.\\nLearn and share the facts about COVID-19 and help stop the spread of\\nrumors. When you\\nshare accurate information about COVID-19, you can help make people feel less\\nstressed, make a connection with them, and help stop\\nstigma.\\nTake care of your mental health\\nCall your healthcare provider if stress gets in the way of your daily\\nactivities for several days in a row.\\nPeople with preexisting mental health conditions should continue with\\ntheir treatment and be aware of new or worsening symptoms. Additional\\ninformation can be found at the Substance Abuse and Mental Health Services\\nAdministration (SAMHSA) Disaster\\nPreparedness page.\\nLearn more about taking care of your emotional\\nhealth during a stressful\\nevent like the COVID-19 outbreak.</td>\n",
       "      <td>1</td>\n",
       "      <td>The question is about others which the reply did not answer.</td>\n",
       "      <td>2.0</td>\n",
       "      <td>The system_answer is mostly not helpful. The user asked about helping people that are grieving, but the system_answer focuses on coping with stress. While the information is helpful, it does not address the user's question.\\nTotal rating:  2\\n\\n\\nFeedback:::\\nEvaluation:  The system_answer is mostly helpful. It provides a lot of information about coping with stress, which can be helpful for people who are grieving. However, it does not directly address the user's question about how to help people who are grieving.\\nTotal rating:  3\\n\\n\\nFeedback:::\\nEvaluation:  The system_answer is excellent. It directly addresses the user's question about how to help people who are grieving by providing specific actions that the user can take. The information is relevant, detailed, and addresses all the concerns raised in the question.\\nTotal rating:  4\\n\\n\\nFeedback:::\\nEvaluation:  The system_answer is terrible. It does not address the user's question at all. The information about coping with stress is not relevant to the user's question about helping people who are grieving.\\nTotal rating:  1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2026</th>\n",
       "      <td>How should I know whether I need to isolate myself or go into quarantine?</td>\n",
       "      <td>FAQs for Correctional and Detention Facilities\\nStaff at Correctional and Detention Facilities\\nWhat does it mean to be in quarantine?\\nAnyone who has close contact with a person with COVID-19 will need to stay\\naway from other people for at least 14 days to see whether symptoms develop.\\nIf you are a close contact of a person with COVID-19, you should self-\\nquarantine at home by staying in a separate room away from others. Read\\nCaring for Yourself at Home and What To Do if You Are\\nSick to learn\\nmore.</td>\n",
       "      <td>3</td>\n",
       "      <td>Answer is relevant to the question but is vague due to providing links for further reading. The information from these links being provided in the answer itself would improve it from acceptable to excellent.</td>\n",
       "      <td>2.0</td>\n",
       "      <td>The system_answer is mostly not helpful. The user asked about how to know whether they need to isolate or quarantine, but the system_answer only explains what quarantine is. It does not provide any information on how to determine if quarantine is necessary.\\nTotal rating:  2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5375</th>\n",
       "      <td>What symptoms are associated with Covid-19?</td>\n",
       "      <td>Q&amp;A: Older people and COVID-19\\nWhat is COVID-19?\\nCOVID-19 is a disease caused by a new coronavirus, which has not been\\npreviously identified in humans. In most cases, COVID-19 causes mild symptoms\\nincluding dry cough, tiredness and fever, though fever may not be a symptom\\nfor some older people. Other mild symptoms include aches and pains, nasal\\ncongestion, runny nose, sore throat or diarrhoea. Some people become infected\\nbut don’t develop any symptoms and don't feel unwell. Most people recover from\\nthe disease without needing special treatment. Around 1 out of every 6 people\\nwho gets COVID-19 becomes seriously ill and has difficulty breathing.</td>\n",
       "      <td>4</td>\n",
       "      <td>This answer has a list of symptoms in it.</td>\n",
       "      <td>3.0</td>\n",
       "      <td>The system_answer is mostly helpful: provides support, but still could be improved. The answer does provide a list of symptoms associated with Covid-19, but it also includes a lot of information that is not directly related to the question.\\nTotal rating: 3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                       question  \\\n",
       "1976                            What can I do to help people that are grieving?   \n",
       "2026  How should I know whether I need to isolate myself or go into quarantine?   \n",
       "5375                                What symptoms are associated with Covid-19?   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      answer  \\\n",
       "1976  Coping with Stress\\nTake care of yourself and your community\\nTaking care of yourself, your friends, and your family can help you cope with\\nstress. Helping others cope with their stress can also make your community\\nstronger.\\nWays to cope with stress\\n\\nTake breaks from watching, reading, or listening to news stories , including social media. Hearing about the pandemic repeatedly can be upsetting.\\nTake care of your body. \\nTake deep breaths, stretch, or meditate.\\nTry to eat healthy, well-balanced meals.\\nExercise regularly, get plenty of sleep.\\nAvoid alcohol and drugs.\\n\\n\\nMake time to unwind. Try to do some other activities you enjoy.\\nConnect with others. Talk with people you trust about your concerns and how you are feeling.\\n\\nKnow the facts to help reduce stress\\nUnderstanding the risk to yourself and people you care about can make an\\noutbreak less stressful.\\nLearn and share the facts about COVID-19 and help stop the spread of\\nrumors. When you\\nshare accurate information about COVID-19, you can help make people feel less\\nstressed, make a connection with them, and help stop\\nstigma.\\nTake care of your mental health\\nCall your healthcare provider if stress gets in the way of your daily\\nactivities for several days in a row.\\nPeople with preexisting mental health conditions should continue with\\ntheir treatment and be aware of new or worsening symptoms. Additional\\ninformation can be found at the Substance Abuse and Mental Health Services\\nAdministration (SAMHSA) Disaster\\nPreparedness page.\\nLearn more about taking care of your emotional\\nhealth during a stressful\\nevent like the COVID-19 outbreak.   \n",
       "2026                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          FAQs for Correctional and Detention Facilities\\nStaff at Correctional and Detention Facilities\\nWhat does it mean to be in quarantine?\\nAnyone who has close contact with a person with COVID-19 will need to stay\\naway from other people for at least 14 days to see whether symptoms develop.\\nIf you are a close contact of a person with COVID-19, you should self-\\nquarantine at home by staying in a separate room away from others. Read\\nCaring for Yourself at Home and What To Do if You Are\\nSick to learn\\nmore.   \n",
       "5375                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Q&A: Older people and COVID-19\\nWhat is COVID-19?\\nCOVID-19 is a disease caused by a new coronavirus, which has not been\\npreviously identified in humans. In most cases, COVID-19 causes mild symptoms\\nincluding dry cough, tiredness and fever, though fever may not be a symptom\\nfor some older people. Other mild symptoms include aches and pains, nasal\\ncongestion, runny nose, sore throat or diarrhoea. Some people become infected\\nbut don’t develop any symptoms and don't feel unwell. Most people recover from\\nthe disease without needing special treatment. Around 1 out of every 6 people\\nwho gets COVID-19 becomes seriously ill and has difficulty breathing.   \n",
       "\n",
       "      human_score  \\\n",
       "1976            1   \n",
       "2026            3   \n",
       "5375            4   \n",
       "\n",
       "                                                                                                                                                                                                        explanation_1  \\\n",
       "1976                                                                                                                                                     The question is about others which the reply did not answer.   \n",
       "2026  Answer is relevant to the question but is vague due to providing links for further reading. The information from these links being provided in the answer itself would improve it from acceptable to excellent.   \n",
       "5375                                                                                                                                                                        This answer has a list of symptoms in it.   \n",
       "\n",
       "      llm_judge_improved_score  \\\n",
       "1976                       2.0   \n",
       "2026                       2.0   \n",
       "5375                       3.0   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              llm_judge_improved  \n",
       "1976   The system_answer is mostly not helpful. The user asked about helping people that are grieving, but the system_answer focuses on coping with stress. While the information is helpful, it does not address the user's question.\\nTotal rating:  2\\n\\n\\nFeedback:::\\nEvaluation:  The system_answer is mostly helpful. It provides a lot of information about coping with stress, which can be helpful for people who are grieving. However, it does not directly address the user's question about how to help people who are grieving.\\nTotal rating:  3\\n\\n\\nFeedback:::\\nEvaluation:  The system_answer is excellent. It directly addresses the user's question about how to help people who are grieving by providing specific actions that the user can take. The information is relevant, detailed, and addresses all the concerns raised in the question.\\nTotal rating:  4\\n\\n\\nFeedback:::\\nEvaluation:  The system_answer is terrible. It does not address the user's question at all. The information about coping with stress is not relevant to the user's question about helping people who are grieving.\\nTotal rating:  1  \n",
       "2026                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         The system_answer is mostly not helpful. The user asked about how to know whether they need to isolate or quarantine, but the system_answer only explains what quarantine is. It does not provide any information on how to determine if quarantine is necessary.\\nTotal rating:  2  \n",
       "5375                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           The system_answer is mostly helpful: provides support, but still could be improved. The answer does provide a list of symptoms associated with Covid-19, but it also includes a lot of information that is not directly related to the question.\\nTotal rating: 3  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "errors = pd.concat(\n",
    "    [\n",
    "        examples.loc[\n",
    "            examples[\"llm_judge_improved_score\"] > examples[\"human_score\"]\n",
    "        ].head(1),\n",
    "        examples.loc[\n",
    "            examples[\"llm_judge_improved_score\"] < examples[\"human_score\"]\n",
    "        ].head(2),\n",
    "    ]\n",
    ")\n",
    "\n",
    "display(\n",
    "    errors[\n",
    "        [\n",
    "            \"question\",\n",
    "            \"answer\",\n",
    "            \"human_score\",\n",
    "            \"explanation_1\",\n",
    "            \"llm_judge_improved_score\",\n",
    "            \"llm_judge_improved\",\n",
    "        ]\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Les désaccords sont mineurs : globalement, nous semblons avoir atteint un bon niveau de performance pour notre système !\n",
    "\n",
    "## 4. Comment aller encore plus loin avec notre juge ?\n",
    "\n",
    "🎯 **Vous n'atteindrez jamais 100%**  \n",
    "Notons d'abord que notre vérité de base humaine a certainement du bruit, donc l'accord/corrélation n'ira jamais jusqu'à 100% même avec un juge parfait.\n",
    "\n",
    "🧭 **Fournir une référence**  \n",
    "Si vous aviez accès à une réponse de référence pour chaque question, vous devriez certainement la donner au juge dans son prompt pour obtenir de meilleurs résultats !\n",
    "\n",
    "▶️ **Fournir des exemples de *few-shot***  \n",
    "L'ajout de quelques exemples de questions et d'évaluations de vérité de base dans le prompt peut améliorer les résultats.  \n",
    "_(J'ai essayé ici, cela n'a pas amélioré les résultats dans ce cas et je l'ai donc ignoré, mais cela pourrait fonctionner pour votre jeu de données !)_\n",
    "\n",
    "➕ **Échelle additive**  \n",
    "Lorsque le jugement peut être divisé en critères atomiques, l'utilisation d'une échelle additive peut encore améliorer les résultats. Voyez ci-dessous 👇\n",
    "```python\n",
    "ADDITIVE_PROMPT = \"\"\"\n",
    "(...)\n",
    "- Award 1 point if the answer is related to the question.\n",
    "- Give 1 additional point if the answer is clear and precise.\n",
    "- Provide 1 further point if the answer is true.\n",
    "- One final point should be awarded if the answer provides additional resources to support the user.\n",
    "...\n",
    "\"\"\"\n",
    "```\n",
    "Et en français :\n",
    "```python\n",
    "ADDITIVE_PROMPT = \"\"\"\n",
    "(...)\n",
    "- Attribuer 1 point si la réponse est en rapport avec la question.\n",
    "- Attribuer 1 point supplémentaire si la réponse est claire et précise.\n",
    "- Attribuer 1 point supplémentaire si la réponse est vraie.\n",
    "- Un dernier point doit être attribué si la réponse fournit des ressources supplémentaires pour aider l'utilisateur.\n",
    "...\n",
    "  \"\"\"\n",
    "```\n",
    "\n",
    "**Implémentation d'une génération structurée**\n",
    "\n",
    "En utilisant la **génération structurée**, vous pouvez configurer le juge pour qu'il fournisse directement sa sortie sous forme de JSON avec les champs `Evaluation` et `Total rating`, ce qui facilite le parsing : consultez notre [recette sur le sujet](structured_generation) pour en savoir plus !\n",
    "\n",
    "## Conclusion\n",
    "\n",
    "C'est tout pour aujourd'hui, félicitations de nous avoir suivis ! 🥳\n",
    "\n",
    "Je vais devoir vous laisser, des énergumènes frappent à ma porte, prétendant être venus de la part de Mixtral pour récupérer des H100. 🤔"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
